Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

Update: 2024-05-03

Description

In this episode of "How AI is Built", we learn how to build and evaluate real-world language model applications with Shahul and Jithin, creators of Ragas. Ragas is a powerful open-source library that helps developers test, evaluate, and fine-tune Retrieval Augmented Generation (RAG) applications, streamlining their path to production readiness.

Main Insights

Challenges of Open-Source Models: Open-source large language models (LLMs) can be powerful tools, but require significant post-training optimization for specific use cases.

Evaluation Before Deployment: Thorough testing and evaluation are key to preventing unexpected behaviors and hallucinations in deployed RAGs. Ragas offers metrics and synthetic data generation to support this process.

Data is Key: The quality and distribution of data used to train and evaluate LLMs dramatically impact their performance. Ragas is enabling novel synthetic data generation techniques to make this process more effective and cost-efficient.

RAG Evolution: Techniques for improving RAGs are continuously evolving. Developers must be prepared to experiment and keep up with the latest advancements in chunk embedding, query transformation, and model alignment.

Practical Takeaways

Start with a solid testing strategy: Before launching, define the quality metrics aligned with your RAG's purpose. Ragas helps in this process.

Embrace synthetic data: Manually creating test data sets is time-consuming. Tools within Ragas help automate the creation of synthetic data to mirror real-world use cases.

RAGs are iterative: Be prepared for continuous improvement as better techniques and models emerge.

Interesting Quotes

"...models are very stochastic and grading it directly would rather trigger them to give some random number..." - Shahul, on the dangers of naive model evaluation.

"Reducing the developer time in acquiring these test data sets by 90%." - Shahul, on the efficiency gains of Ragas' synthetic data generation.

"We want to ensure maximum diversity..." - Shahul, on creating realistic and challenging test data for RAG evaluation.

Ragas:

Docs

Jithin James:

Shahul ES:

X (Twitter)

Nicolay Gerold:

⁠LinkedIn⁠

⁠X (Twitter)

00:00 Introduction

02:03 Introduction to Open Assistant project

04:05 Creating Customizable and Fine-Tunable Models

06:07 Ragas and the LLM Use Case

08:09 Introduction to Language Model Metrics (LLMs)

11:12 Reducing the Cost of Data Generation

13:19 Evaluation of Components at Melvess

15:40 Combining Ragas Metrics with AutoML Providers

20:08 Improving Performance with Fine-tuning and Reranking

22:56 End-to-End Metrics and Component-Specific Metrics

25:14 The Importance of Deep Knowledge and Understanding

25:53 Robustness vs Optimization

26:32 Challenges of Evaluating Models

27:18 Creating a Dream Tech Stack

27:47 The Future Roadmap for Ragas

28:02 Doubling Down on Grid Data Generation

28:12 Open-Source Models and Expanded Support

28:20 More Metrics for Different Applications

RAG, Ragas, LLM, Evaluation, Synthetic Data, Open-Source, Language Model Applications, Testing.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Vector Search at Scale: Why One Size Doesn't Fit All | S2 E13

2024-11-0736:26

Search Systems at Scale: Avoiding Local Maxima and Other Engineering Lessons | S2 E12

2024-10-3154:47

Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

2024-10-2549:22

Building the database for AI, Multi-modal AI, Multi-modal Storage | S2 E10

2024-10-2344:54

Numbers, categories, locations, images, text. How to embed the world? | S2 E9

2024-10-1046:44

Building Taxonomies: Data Models to Remove Ambiguity from AI and Search | S2 E8

2024-10-0458:40

From PDFs to Pixels: How ColPali is Changing Information Retrieval | S2 E7

2024-09-2754:57

Beyond Embeddings: The Power of Rerankers in Modern Search | S2 E6

2024-09-2642:29

Limits of Embeddings: Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) | S2 E5

2024-09-1946:06

RAG at Scale: The problems you will encounter and how to prevent (or fix) them | S2 E4

2024-09-1250:09

From Keywords to AI (to GAR): The Evolution of Search, Finding Search Signals | S2 E3

2024-09-0552:16

Data-driven Search Optimization, Analysing Relevance | S2 E2

2024-08-3051:14

Query Understanding: Doing The Work Before The Query Hits The Database | S2 E1

2024-08-1553:02

Season 2 Trailer: Mastering Search

2024-08-0804:16

Unlocking Value from Unstructured Data, Real-World Applications of Generative AI | ep 17

2024-07-1636:28

Data Processing for AI, Integrating AI into Data Pipelines, Spark | ep 16

2024-07-1246:26

Building AI Agents for the Enterprise: Realistic Use Cases, Cost Controls, Seamless UX | ep 15

2024-07-0435:12

Building Predictable Agents: Prompting, Compression, and Memory Strategies | ep 14

2024-06-2732:14

Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3

2024-06-2514:53

ETL for LLMs, Integrating and Normalizing Unstructured Data | ep 13

2024-06-1936:48

00:00

1.0x

Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

#box-pro-ellipsis-173148617335982{-webkit-line-clamp:2;}Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

Nicolay Gerold

Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5